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(57) The method serves to automatically identify in 
a set of .data sequences at least one specific type of in- 
formation contained in each data sequence of the set, 
wherein the type of information has an unknown pres- 
entation in the data sequences. It comprises the steps 
of: 

- initially defining at least one characteristic feature 
of the specific type of information, and of expressing 
the characteristic feature(s) in terms of at least one 
recognition rule executable by processor means 
(2), 

applying the recognition rule(s) through the proces- 
sor means to analyse the set of data sequences, 
determining in each data sequence a data portion 
thereof satisfying the recognition rule(s), and 
identifying the data portion as corresponding to the 
specific type of information. 

The invention can be used notably for automatically 
processing the contents of music file names, where the 
data sequence corresponds to the characters for a mu- 
sic file, and the specific information types are an artist 
name and/or music title contained in some arbitrary form 
and order in the file name. 
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Description 

[0001 ] The present invention relates to a method and 
apparatus for automatically identifying at least one spe- 
cific type of information contained in a data sequence. 
The data sequence can correspond e.g. to the charac- 
ters forming a file name attributed to a music file or other 
form of computer file. In the case of a music file (some- 
times also known as an audio file), the specific type of 
information in question can be an artist name and/or a 
music title contained in the character sequence forming 
the file name. 

[0002] Such an automatic information identification 
can be used for managing large sets of music files lo- 
cated on personal storage medium, such as hard disks, 
CD roms DVD roms, minidisks, etc.). The information 
thus extracted can be used in various applications in ar- 
eas of sorting, archiving, computer assisted music title 
compilation and playlist generation, etc.. 
[0003] A music file is generally a data module contain- 
ing binary data that encodes recorded music corre- 
sponding to a music title. The data can be read from the 
file and processed to produce an audio output exploita- 
ble by a computer or suitable sound reproduction sys- 
tem. Music files are generally handled and managed like 
other computer files, and have arbitrarily chosen file 
names which serve to indicate the associated audio 
content, usually a music title and artist. For instance the 
file name can be made to indicate the artist and the song 
or album corresponding to the audio contents. The au- 
dio file will typically also have an extension (part appear- 
ing just after a dot) indicating the music format, normally 
a compression protocol such as mp3, wav, or the like. 
File names can be given by music distributors, or by end 
users who create their own audio files. 
[0004] There is nowadays a rapidly growing number 
of users who create and store vast collections of such 
audio files (over one thousand) on personal storage me- 
dium, typically computer hard disks and writable CDs . 
The music files of a collection can have different origins : 
personal CD collections, files downloaded from internet 
sites, such as those which sell music titles online, CD- 
DB, radio recordings, etc. 

[0005] At present, there is no standardised format for 
naming files, either in terms of syntax or in terms of artist 
name and title. In particular, users are normally confront- 
ed with disparate titling formats in which the order and 
form of the identification information can vary from one 
title to another. This lack of uniformity is clearly apparent 
when consulting lists of audio files presented at random 
from different users, e.g. in the internet sites which sell 
music titles online. 

[0006] Some recording formats such as mp3 include 
so-called metadata which serves to identify the artist 
and title, but again no set rule is established stating how 
that information is to be organised. Likewise, there is no 
universal coding system for artist names or songs or 
track titles. For example, the pop group "The Beatles" 



will appear in some catalogues under "Beatles", while 
in others under The Beatles", or again "Beatles, The". 
Similarly, the lack of universal coding of music title file 
names is also a source of problem, especially when 

5 dealing with lengthy and complex title names. In partic- 
ular, there is no rule regarding the order of mention of 
the artist and music title in a file name. 
[0007] There then arises a problem of distinguishing 
the artist from the music title contained in a music file 

10 name, starting from the fact that the file name can be 
expected to contain that information in some form, pos- 
sibly with abbreviations. 

[0008] This distinguishing task is normally easy for a 
human being, whose cognitive and thinking processes 

15 are well suited to such recognition and sorting tasks. 
Nevertheless, it quickly become tedious when having to 
manage vast collections of audio files e. g . of over a thou- 
sand titles and possibly much more. 
[0009] Moreover, a manual identification does not in 

20 itself allow the useful information to be passed on to a 
music title management system without some additional 
human intervention. Such a manual approach would 
thus defeat the object of creating a fully automated and 
flexible system. 

25 [0010] In view of the foregoing, a first object of the in- 
vention is to provide a method of automatically identify- 
ing in a set of data sequences at least one specific type 
of information contained in each data sequence of the 
set, wherein the type of information has an unknown 

30 presentation in the data sequences, characterised in 
that it comprises the steps of: 

initially defining at least one characteristic feature 
of the specific type of information, and of expressing 

35 the characteristic feature(s) in terms of at least one 
recognition rule executable by processor means, 
applying the recognition rule(s) through the proces- 
sor means to analyse the set of data sequences, 
determining in each data sequence a data portion 

40 thereof satisfying the recognition rule(s), and 

identifying the data portion as corresponding to the 
specific type of information. 

[001 1] It can be appreciated that the invention effec- 
ts tively forms an automated means for extracting items of 
information from a source in which those items are not 
expressed in a rigorous manner, or are presented in a 
manner which is not known a priori at the level of means 
performing the automatic identification. In this respect, 
so the invention can be seen as a means for extracting fea- 
tures or rules from a system of information where those 
features or rules are not identified or labelled by that sys- 
tem. 

[0012] Thus, in the context of names attributed to mu- 
55 sic files, the invention makes it possible to recognise au- 
tomatically an artist name and a music title when these 
items of information are not expressed in the filename 
with rigour or according to a universal protocol. 
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[0013] The determining step can comprise a sub-step 
of picking out from the data sequence different data por- 
tions corresponding to respective types of information 
and applying the recognition rule(s) on each the picked 
out data portion. 

[0014] One recognition rule can instruct to identify the 
specific type of information in terms of frequency of oc- 
currence of a data portion over the set of data sequenc- 
es. 

[0015] Thus, the determining step further comprises 
the sub-steps of: 

- determining relative positions of the different data 
portions within a data sequence, 

- comparing, over the set of data sequences, data 
portions occupying the same relative position in the 
data sequence, and 

determining from the comparison the relative posi- 
tion where there is the greatest occurrence of iden- 
tical data portions over the set of data sequences, 

and wherein the step then involves identifying the data 
portion located at the relative position of greatest occur- 
rence as corresponding to the specific type of informa- 
tion. 

[0016] Another recognition rule can instructto identify 
the specific type of information type in terms of the size 
of a data portion of the data sequence, and/or instruct 
to identify the type of information type in terms of a rel- 
ative position of a data portion in the data sequence. 
[0017] The determining step can comprise the follow- 
ing sub-steps, applied to at least some of the data se- 
quences of the set: 

determining a candidate data portion in a data se- 
quence, and 

comparing the candidate data portion against a 
stored set of data portions known to correspond to 
the specific type of information to be identified, 

wherein the identifying step involves identifying 
the data portion found to be present in the data base as 
corresponding to the specific type of information. 
[0018] There can be provided a step, prior to the de- 
termining step, of normalising the data sequence by re- 
moving from the data sequence data not susceptible of 
being contained in the specific type of information to be 
identified. 

[0019] There can also be provided a step, prior to the 
determining step, of identifying in the data sequence 
separator data separating different data portions there- 
in, by reference to a stored set of possible separator 
characters. 

[0020] The data sequence corresponds to characters 
forming a file name of a computer file. 
[0021] In the embodiment, the set of data sequences 
corresponds to a respective set of file names of music 
files, each data sequence being the characters forming 



a corresponding music file name, and a data portion be- 
ing a character field containing information of a given 
type, and the specific type of information to be identified 
comprises at least one of: 

a first type of information corresponding to an artist 
name contained in the music file name, and 
a second type of information corresponding to a mu- 
sic title name contained in the music file name. 

[0022] In this case, the method can further comprise 
a step, prior to the determining step, of determining a 
separator character present between character fields 
respectively assigned to the first and second types of 
information. 

[0023] Preferably, the separator character is inferred 
as being: i) neither a digit, nor a letter, nor a space, and 
ii) present the same number of times in all file names 
excluding starting and ending positions. 
[0024] There can be provided a further step of detect- 
ing the presence of a character cluster composed of a 
first part which is constant and a second part which is 
variable over the set of music file names, the second 
part being e.g. an integer or equivalent counf character, 
and of eliminating that character cluster from the char- 
acter sequence. 

[0025] A recognition rule in the context of music files 
can instruct to identify the first type of information as 
contained in the character field forming the most words 
among character fields assigned to respective types of 
information, and/or as contained in the character field 
which has the most occurrence in identical form in the 
set of music file names, and/or as contained in the char- 
acter field matching a character field in a set of stored 
character fields corresponding artist names, and/or as 
contained in the first character field appearing in the mu- 
sic file name. 

[0026] The determining and identifying steps can in- 
volve the sub-steps of: 

identifying in the characters forming the music file 
name a first character field and a second character 
field, one the field containing the first type of infor- 
mation (artist name) and the other containing the 
second type of information (music title name), 
determining, by reference to an artist database con- 
taining character fields each corresponding to a re- 
spective artist name, a first value corresponding to 
the number of occurrences, over the set of music 
file names, of a first character field contained in the 
artist database, and a second value corresponding 
to the number of occurrences, over the set of music 
file names, of a second character field contained in 
the artist database, wherein 
if the first value is greater than the second value, 
identifying the first character field as corresponding 
to an artist name, 

if the second value is greater than the second value, 
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identifying the second character field as corre- 
sponding to an artist name, 
if the first and second values are equal, continuing 
by: 

5 

determining a new first value corresponding to 
the number of different contents of the first 
character field over the set of music file names 
and a new second value corresponding to the 
number of different contents of the second 10 
character field over the set music file names, 
wherein 

if the first value is greater than the second val- 
ue, identifying the second character field as 
corresponding to an artist name, 
if the second value is greater than the second 
value, identifying the first character field as cor- 
responding to an artist name, 
if the first and second values are equal, contin- 
uing by: 

determining a new first value correspond- 
ing to the total number of words in the first 
character field summed over the entire set 
of music file names and a new second val- 
ue corresponding to the total number of 
words in the second character field 
summed over the entire set of music file 
names, wherein 
- if the first value is greater than the second 
value, identifying the first character field as 
corresponding to an artist name, 
if the second value is greater than the sec- 
ond value, identifying the second character 
field as corresponding to an artist name, 
and 

if the first and second values are equal, 
identifying the first character f ietd as corre- 
sponding to an artist name. 

[0027] There can further be comprised the step of ap- 
plying rewriting rules to at least one of an artist name 
and a music title name identified from a music file name, 
the rewriting rules being executable by the processor 
means for transforming an artist name/music title name 
into a form corresponding to that used for storing artist 
names/music title names in a database. 
[0028] The method may also comprise a step of com- 
piling a directory of rewritten music file names, corre- 
sponding to the identified music file names, in which at 
least one of an artist name and a music title name is 
organised to be machine readable. 
[0029] It may further comprise the step of constructing 
for each music file name a machine readable informa- 
tion module comprising at least an identified artist name 
and an identified music title name, to which is associated 
metadata, the metadata being provided from a database 
on the basis of the identified artist name and/or music 



title name. 

[0030] The metadata can be indicative of a genre or 
genre/subgenre associated with the corresponding mu- 
sic title. 

[0031 ] A second object of the invention concerns the 
use of the above method in a music playlist generator, 
wherein the playlist generator accesses stored music 
files by reference to identified artist names and/or iden- 
tified music title names. 

[0032] A third object of the invention concerns an ap- 
paratus for automatically identifying in a set of data se- 
quences at least one specific type of information con- 
tained in each data sequence of the set, wherein the 
type of information has an unknown presentation in the 
data sequences, characterised in that it comprises the 
steps of: 

means for expressing at least one characteristic 
feature of the specific type of information, and for 
expressing the characteristic feature(s) in terms of 
at least one machine executable recognition rule, 
processor means for applying the recognition rule 
(s) to analyse the set of data sequences, 
- determining means for determining in each data se- 
quence a data portion thereof satisfying the recog- 
nition rule(s), and 

identifying means for identifying the data portion as 
corresponding to the specific type of information. 

[0033] The optional aspect of the method defined 
above apply mutatis mutandis to that apparatus. 
[0034] A fourth object of the invention concerns a sys- 
tem combining the above apparatus with a music playlist 
generator, wherein the playlist generator accesses 
stored music files by reference to identified artist names 
and/or identified music title names. 
[0035] The invention can thus provide automated 
means for identifying items of information expressed in 
file names - or more generally in a data sequence - e.g. 
to pick out an artist name and/or music title from a music 
file name organised in one of different possible ways. 
These means can be used in conjunction with automat- 
ed systems that manage large numbers of audio files 
for creating music programs, compiling playlists, intelli- 
gently sorting, archiving, etc. In this way, the means of 
the invention form an interface between a collection of 
files named in a random manner and an intelligent file 
management system which requires precisely present- 
ed identification information. 

[0036] In this context, the invention can find applica- 
tions in a comprehensive management system provid- 
ing the following functionalities: 1) automatic recognition 
of title and artist identifiers from music file names, 2) au- 
tomatic classification of music titles using external 
sources of metadata (e.g. genre/subgenre), 3) mecha- 
nisms for handling all possible listening situations/be- 
haviours ranging from focussed (e.g. subgenre) to open/ 
exploratory modes, and 4) a facility for exchanging user- 
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specific categories through global servers or peer-to- 
peer communications systems. 
[0037] The invention has applications in end user soft- 
ware for PC, Interactive Digital Television (IDTV) via set- 
top boxes or integrated TVs, internet music servers, and 
Electronic Music Delivery services in general. 
[0038] The invention and its advantages shall be 
more clearly understood upon reading the following de- 
tailed description of detailed embodiments, given purely 
as non-limiting examples, in conjunction with the ap- 
pended drawings, in which : 

- figure 1 is a simplified block diagram showing a pos- 
sible application of an information identification de- 
vice in accordance with the invention in the context 
of music title and artist name extraction from music 
file name data; 

- figure 2 is a general flowchart showing a procedure 
used by the information identification device of fig- 
ure 1 to produce normalised formats of music title 
and artist names from music file name data; 

- figure 3 is a flowchart of an inferencing routine used 
in the procedure of figure 2; and 

- figure 4 is a general diagram showing the data flow 
in the system of figure 1. 

[0039] Figure 1 shows a typical system in which the 
invention can be integrated. The system 1 in this exam- 
ple is centred around a personal computer (PC) 2 which 
is here decomposed in terms of a CPU (central process- 
ing unit) and internal management section 4, and one 

- or more hard disk(s) 6 containing music files. Interfacing 
with the user is through a normal computer video mon- 
itor 5 and a keyboard with an associated screen pointing 
and selecting device such as a mouse or trackball 7. 
The music files are loaded and accessed by the internal 
management section 4 using standard techniques. 
These files are acquired from different possible audio 
input sources to which the PC can be connected. In the 
example, these include : 

internet servers 8 such as sites which sell music ti- 
tles online, which generally allow music files to be 
downloaded complete with the file name attributed 
by the provider. To this end, the PC access and stor- 
age unit 2 is equipped with a modem or other suit- 
able interface and the appropriate internet software 
to establish the connections required; 
broadcast music from radio or TV stations 1 0. The 
stations in question can be internet radio, cable, sat- 
ellite, AM or FM stations; and 
- recorded media players 12, such as compact disk 
or tape players, for transferring pre-recorded music 
into the hard disk 6. 

[0040] With the last two sources 10, 12, the music is 
generally not presented in the form of a music file (ex- 
cept in the case of a CD rom or the like). The recorded 



sound is thus processed by appropriate software within 
the PC 2 in accordance with a given compression pro- 
tocol (mp3, wav, etc.) and given a file name by the user 
prior to storing in the hard disk(s) 6. 
5 [0041] Operating in conjunction with the PC access 
and storage unit 2 are four functionally separated mod- 
ules: 

a music file identifier 14, which constitutes an em- 

10 bodiment of the invention. Its main task in this ex- 
ample is to identify and reformat automatically both 
the artist name and the title name contained in a 
given music file name; 
- a musical category generator 16, which is a soft- 

15 ware tool for sorting and cataloguing musical items 
in terms of genres and/or sub-genres, or other cri- 
teria. These are either already contained in the form 
of metadata incorporated in a music file, or entered 
manually by the system user; 

20 - a music playlist generator 18, which is a tool for 
building playlists, i.e. ordered sequences of musical 
items, on the basis of users' tastes, statistical anal- 
yses on previously recorded sequences, and a host 
of other selection criteria. An example of such a mu- 

25 sic playlist generator is described in copending Eu- 
ropean patent application EP 00 403 556.4 by the 
present Applicant. Basically, the music playlist gen- 
erator 1 8 exploits information analysed in the musi- 
cal category generator 1 6 to produce music pro- 

30 grams, i.e. sequences of music titles, based on: 

the user profile 

similarity relations 

the degree of novelty desired. 

35 

[0042] For a detailed description of that particular mu- 
sic playlist generation system, reference can be made 
to European Patent Application No.00 403 556.4, filed 
on December 15, 2000, and 

40 

a client/server interface 20, which functionally links 
all the elements 2-18 mentioned above to provide 
the user with an integrated set of inputs and outputs 
through an interactive software interface. The latter 
45 appears in the form of menu pages and icons with 
on-screen pushbuttons and cursors displayed on 
the monitor 5. 

[0043] All the components of system 1 are intercon- 
50 nected for exchanging commands and data through a 
shared two-way communications system 22. Depend- 
ing on the implementation of the apparatus 1 , the com- 
munications system 22 can be a physical bus linking the 
different component units 2-20, or more generally a data 
55 exchange protocol in a software based configuration. In 
a typical embodiment, the music file identifier 14, musi- 
cal category generator 16, music playlist 18, and client/ 
server interface 20 are in the form of software or 
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firmware modules incorporated within a PC or in one or 

several boxes connected to the latter. 

[0044] The remainder of the description shall focus on 

the music file identifier 14, the other components being 

known in themselves and outside the core of the present 

invention. 

[0045] The task of the music file identifier 1 4 is to help 
build a database of music titles automatically from a set 
of files randomly located on a personal storage system, 
such as the hard disk(s) 6. The database in question 
comprises identification information related to a set of 
music files located on the storage medium. 
[0046] Music files of various types may be present on 
the medium (e.g. wav, mp3, Atrac3, etc.). The main task 
in this context is to assign an artist and title identification 
to each of these files. 

[0047] This task involves on the one hand obtaining 
the basic artist/name identification from the file name 
and on the other unambiguously identifying the artist 
and title information (i.e. coping with ambiguities, typos 
and errors in general). 

[0048] The complexity of the task of interpreting vari- 
ous syntaxes shall be illustrated by examples of possi- 
ble file names for the music title from The Beatles enti- 
tled "Eleanor Rigby" (from the album "Revolver"): 

The Beatles - Eleanor Rigby.mp3 
eleanor rigby; the beatles.mp3 

- The Beatles - Revolver - Eleanor Rigby.MP3 

- The Beatles - Eleanor Rigby - Revolver - Track 
2.mp3 

Eleanor Rigby - Beatles, The.mp3 

- Eleanor Rigby - Beatles, The,mp3 

- etc. 

[0049] In the simplest case, these two items of infor- 
mation (artist name and title name) are located in the 
audio file itself, for instance through so-called "ID tags" 
in an mp3 file. However, ID tags are not standardised 
and in many instances music files do not contain this 
information. The only way to obtain it therefore is 
through an analysis of actual file names. Moreover, even 
when ID tags are not empty, they may contain errors or 
ambiguities. 

[0050] The main problem to solve in this case is to 
guess the syntax of the file name so as to extract there- 
from the artist and name information, whenever possi- 
ble. 

[0051] To this end, the Applicant conducted a large- 
scale analysis of existing music file names (on individual 
hard disks, playlists, and databases such as CDDB), 
and determined a set of heuristic rules through which 
the required information could be inferred. From these 
heuristic rules can be developed machine interpretable 
recognition rules for implementing the identification 
task. 

[0052] Because music files are usually grouped in di- 
rectory structures on storage systems, the problem was 



reduced to identifying sets of related music files rather 
than individual files. Considering sets of files as op- 
posed to individual files allows to deduce automatically 
valuable information on the file name syntax. 
5 [0053] This set of heuristic rules can be turned into a 
process which 

- takes as input: 

10 - a set of music file names, typically correspond- 
ing to music files in a given directory or subdi- 
rectory structure, or to a CD playlist, e.g. as re- 
turned by the CDDB server, 
a database of existing artist and titles. This da- 

15 tabase is typically located at an internet server. 

It can be partially present, i.e. only a database 
of artist names, or even totally absent, 

- and yields as output: 

20 

for each file name, the most probable artist and 

title information, and (possibly) 

an update of the artist and title database. 

25 [0054] In the preferred embodiment, the process in- 
volves executing a sequence of tasks - indicated below 
by respective numbered paragraphs - which take the 
form of modules. These shall be described with refer- 
• ence to the flow charts of figures 2 and 3. 

30 [0055] The process begins by loading a file name into 
the music file identifier 1 4 (step S1 ). In the course of the 
process, a number of file names - preferably as many 
as possible - shall be processed. As shall appear further, 
these file names are treated both sequentially and col- 

35 lectively. Collective processing is used when dealing 
with samples for statistical analysis, e.g. for inferring art- 
ist and title name ordering (cf. figure 3). For optimising 
collective processing, the files considered shall prefer- 
ably be extracted from a common source : a same pre- 

40 recorded medium (e.g. a CD), a same collection, and 
more generally from a same directory of the hard disk 
(s) 6 insofar as it can be assumed that the division into 
directories reflects some commonness in the audio file 
name attribution. 

45 [0056] Task module 1 ): normalise the file name (step 
S2). 

[0057] This involves setting the file name into a stand- 
ardised typographical form in preparation for the subse- 
quent task modules. The normalisation does not in itself 
so alter or extract data from the file name. 

[0058] In the example, file name standardisation in- 
volves performing the following tasks: 

1 .1 . - remove trimming spaces, i.e. blank characters 
55 which may be present at the start and /or end of a 

string of characters 

1 .2. - convert to upper case 

1 .3. - remove all non digit, non letter and non sep- 
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arator characters, and replace by 

[0059] For instance, by applying rules 1 .1 to 1 .3, afile 
name such as " Eleanor Rigby - The Beatles - Revolver 
@ track 3.mp3 " would become "ELEANOR RIGBY - 
THE BEATLES - REVOLVER_TRACK3.MP3 w . 
[0060] Once the file name has been normalised, it is 
stored in a normalised name memory for future refer- 
ence during the course of the procedure (step S3). 
[0061] The process then continues by seeking wheth- 
er a new file name is to be processed as above (step 
S4). It thus follows a return loop (L1) back to step S1 
until all the file names forming a set to be considered 
have been normalised and stored. These normalised file 
names thus obtained are stored in the normalised mem- 
ory for future reference. 

[0062] There is then extracted from the normalised 
name memory the first normalised file name (step S5) 
for processing by the next task module 2) below. 
[0063] Task module 2): check if the artist and name 
information are located in an ID tag in the corresponding 
music file itself (step S6). 

[0064] This task serves to determine whether it is nec- 
essary to infer the artist and title names from the file 
name. This is clearly unnecessary if that information is 
readily obtainable at the level of the ID tag contained as 
metadata in the music file. 

[0065] If the information is obtainable from the ID tag, 
the process moves to a step S7 of extracting the artist 
and title names from the ID tag and of storing them in 
corresponding memory registers for subsequent refer- 
ence. 

[0066] Otherwise (ID tag not present), the process 
moves from step S6 to an inferencing routine R1 (task 
module 3). 

[0067] Task module 3): order inferencing routine (R1). 
[0068] This task is executed when there is no ID tag 
to exploit. It comprises a self-contained routine compris- 
ing a series of steps indicated in figure 3. The routine 
rests on the assumption that although the syntax of the 
file names is unknown (for instance it can be ordered in 
terms of artist followed by title name, or title name fol- 
lowed by artist, or other), it is going to be the same for 
all files in the directory. The subtasks of the order infer- 
encing are: 

3.1 . Infer the main separator character (step S8). 

[0069] The separator character can be for instance 
"-", "_", or or any character belonging to a list of sep- 
arator characters established to that effect, designated 
SEPARATOR_SET and stored in an appropriate mem- 
ory location. The latter is a set of all known separator 
characters susceptible of being used in a file name. 
[0070] The inference is performed by computing the 
intersection of all characters for all files analysed, and 
retaining only those characters which are: 



neither digits, nor letters, nor spaces, 

present the same number of times in all file names 

excluding starting and ending positions. 

5 [0071] To this end, inferencing step S8 involves ana- 
lysing some, or preferably all, of the normalised file 
names stored at step S3: the larger the sample, the 
more the inferencing is reliable. 
[0072] For instance, in the case of "ELEANOR RIGBY 

10 - THE BEATLES - REVOLVER_TRACK3.MP3", the 
separator would be , provided it is confirmed that this 
separator character is present for all files in the consid- 
ered set of normalised file names. 

is 3.2. Infer constant parts (step S9). 

[0073] File names may contain constant parts, usually 
album names, possibly augmented with track names. 
This means that each file name may have the form "con- 

20 stant+variable" , separated from the rest by a separator. 
Here, the terms "constant" and "variable" are taken to 
mean respectively constant throughout all or a deter- 
mined proportion of the analysed normalised|ile names 
and variable from one normalised file name tbanother. 

25 [0074] For instance, the file name "ELEANOR RIGBY 
- THE BEATLES - REVOLVER_TRACK3.MP3" has a 
constant part of " REVOLV E R_TRACK" followed by an 
integer variable "3 a . 

[0075] The "constant" part of the file name can be 

30 identified by standard character string comparison tech- 
niques, on the basis that a character string separated 
by a separator and found to recur among the analysed 
normalised file names. In the above example, two such 
constant parts could possibly be identified: "THE 

35 BEATLES" and "R EVOLVE R_TRACK". However, only 
the latter is followed by the above-mentioned variable. 
The constant part" REVOLV ER_TRACK" is then select- 
ed as the one to take into account in that step simply by 
checking for the presence of a variable character follow- 

40 jng these two candidate strings. 

[0076] Once identified, if present, this constant part is 
removed from the normalised file names together with 
its following variable (step S9). For instance, the pre- 
ceding file name would become: "ELEANOR RIGBY - 

45 THE BEATLES. MP3". From that point on, it can be as- 
sumed that the resulting file names are in one of two 
forms (excluding the extension): 

"artist + separator + title", or 
so "title + separator + artist". 

3.3. Infer artist/title ordering (step S10) 

[0077] Here, each title is considered to possess two 
55 types of information, designated "column 1" and "col- 
umn 2". The task is then to infer whether column 1 cor- 
responds to the artist or title name. Knowing which col- 
umn is the artist, it can be deduced that the other one is 
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the title, and vice versa, 

[0078] To infer which is the artist column, the proces- 
sor 2 is made to execute recognition rules, which are 
algorithms constructed from respective heuristic rules. 
The heuristic rules are deduced at an initial stage from 
the general characteristics of the type of information to 
be identified. In the example where the type of informa- 
tion includes an artist name in a data sequence contain- 
ing in some undetermined form both the artist and the 
music title names, the are heuristic rules (HR) used: 

HR1 : artist names are less numerous than title 
names. A given artist usually produces more than 
one title. As a separate consideration, virtually any 
short sentence can be a title name. Therefore, it is 
a realistic goal to build a database of all artist 
names. Such a database can be contained internal- 
ly within a given memory portion of the system and/ 
or outside the system, e.g. from an online provider 
or server through an internet/intranet connection. 
HR2 : artist names are more redundant than title 
names. For instance, it is frequent that in a given 
directory, two or more audio files are from the same 
artist. Applying this heuristic rule thus involves com- 
paring names in column 1 and likewise those ap- 
pearing in column 2. A repetition of a same name 
in one of column 1 or column 2 is then taken as an 
indication that the repeated name could indeed cor- 
respond to an artist name, and that the column in 
which this repetition has occurred is occupied by 
artist names. 

HR3: artist names contain, on average, fewer 
words than title names. For instance, typical artist 
names are "Supertramp" or "Rossini" (1 word), or 
"The Beatles" (2 words), whereas typical title names 
are "Breakfast in America" (3 words), 'The Italian 
Girl in Algiers" (5 words), "I Wanna Hold Your Hand 
(5 words), etc. Of course, there are numerous ex- 
ceptions, i.e. artist names longer than title names. 
However, the Applicant has discovered that on av- 
erage, these exceptions are compensated in a giv- 
en set of files. This heuristic rule can be performed 
by counting the words contained each of the names 
appearing in column 1 andcolumn2overthesetof 
file names stored at step S3, and deducing that the 
column for which the number of words is the least 
contains the artist names (or, by corollary, deducing 
that the column for which the number of words is 
the most contains the music title names). 
HR4 : in most cases, artist names appear before ti- 
tles. 

[0079] The inferencing routine can apply some or all 
of these heuristic rules. Where more than one heuristic 
rule is applied, a hierarchy can be established, whereby 
the routine is interrupted as soon as a meaningful result 
is obtained from one of the rules. 
[0080] In the example, all four heuristic rules are pro- 



grammed for execution in the order HR1, HR2, HR3 and 
HR4. Each of these rules is expressed as a respective 
recognition rule which directs the processor 2 to execute 
tasks on the normalised file names. These tasks are 

5 aimed at deriving a true/false response to an induced 
assumption that a designated one of column 1 and col- 
umn 2 names satisfies the corresponding heuristic rule, 
i.e. corresponds to an artist name. 
[0081] For instance, heuristic rule HR3 has a corre- 

10 sponding recognition rule which is implemented by de- 
termining through counting tasks whether it is true or 
false that column 1 names, say, contain more words 
than column 2 names. 

[0082] Note that heuristic rule HR4 is a default attri- 
15 bution for which the recognition rule is simply involves 
the task of forcing a true response for column 1 names. 
[0083] To implement these heuristic rules HR1 to 
HR4, the embodiment executes the following sequence 
of procedures: 

20 [0084] Procedure ARTIST_IS_FIRST (COLUMN1, 
COLUMN2) Returns a BOOLEAN: 

1 ) "Look for known artists" (heuristic rule HR1) 

Given an existing database of artists, compute 
25 OCC1, the number of occurrences of column 1 
names which are in the artist database. Likewise, 
compute OCC2, the number of occurrences of col- 
umn 2 names which are in the artist database. 
Note: to check that a given string of characters 
30 is included in the artist or title database, the proce- 
dure does not perform a simple string matching (i. 
e. character-by-character), because the column 
1 /column 2 names may be subject to some errors, 
as mentioned above. Instead, the procedure de- 
35 scribed below (checking entries in artist or title da- 
tabases) is used. 

If OCC1 > 0 and (OCC2 = 0) then return TRUE 
(i.e. COLUMN1 is ARTIST) 
40 If OCC1 = 0 and (OCC2 > 0) then return FALSE 

(i.e. COLUMN2 is ARTIST) 

2) "Look for repeating artists" (heuristic rule HR2) 

Compute OCC1 , the number of different items 
45 for column 1 . Likewise, compute OCC2, the number 
of different items for column 2. 

If OCC1 > OCC2then return FALSE 
If OCC2 > OCC1 then return TRUE 

so 

3) "Look for average number of words" (heuristic 
rule HR3) 

Compute OCC1, the total number of words in 
items of column 1. Likewise, compute OCC2, the 
55 total number of words in items of column 2. 

If OC1 > OCC2 then return TRUE 
If OCC2 > OCC 1 then return FALSE 
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4) "By default, artists are first" (heuristic rule HR4) 
return TRUE 

[0085] The thus-inferred artist names and music title 
names are stored in respective registers for future ref- 5 
erence in the remainder of the music file identification 
procedure (step S 11). 

[0086] The order inferencing routine R1 is then termi- 
nated. 

[0087] Once these heuristic rules have been applied, 10 
there is then performed the task of checking artist and 
title names written in the respective registers against en- 
tries in a database. This database can be the one used 
in heuristic rule HR1 for the artist name, coupled to a 
similar data base of music title names, also accessed is 
from the internal memory or through a provider via in- 
ternet/intranet. 

[0088] This task allows, among other things, to check 
for possible mistakes, typos and errors in general in art- 
ist or character strings. 

[0089] To do so, use is made of a separate database 
containing rewriting rules. These rules are applied sys- 
tematically to an artist or title information (obtained from 
the preceding module), and transform that information 
to yield a "canonical form" (step S12, figure 2). It is this 
canonical form which is checked against the corre- 
sponding canonical form of entries in the artist/music ti- 
tle name database. 

[0090] For artist names, the artist rewriting rules 
(ARR) are the following: 

ARR1: name, The -> The name (i.e. definite pro- 
noun placed before the name, and intervening com- 
ma removed). 

ARR2: name, Les -> Les name (for French groups) 
ARR3: name, firstname -> firstname name, where 
firstname belongs to a Fl RSTN AM E_DATABAS E. 
The latter is simply a stored list of possible first 
names against which the variable "firstname" is 
checked. 

ARR4: Namel (name2) -> namel (i.e. any refer- 
ence placed in parentheses after Name 1 is re- 
moved. For example, "Yesterday (stereo version)" 
would become "Yesterday", likewise "Yesterday 
(mono mix)", would become "Yesterday", etc.) 
ARR5: Any space character is removed. 
ARR6: All accentuated characters are replaced by 
their noh accentuated equivalents (e.g. e" is re- 
placed by "e"). 

[0091] Other rewriting rules can be envisaged, e.g. to 
process in different titles in different languages. 
[0092] By applying these rules, the following exam- 
ples of transformations are produced: "BEATLES, THE" 
-> "THEBEATLES"; "FRANcOISE HARDY" -> FRAN- 
COISEHARDY". 

[0093] For titles, the title rewriting rules (TRR) are the 
following: 



TRR1 : Namel (name2) -> namel 

[0094] Once the names have been rewritten accord- 
ing to a standardised format, a spell check is made on 
the thus-determined artist name. This check involves 
comparing the characters that form the detected artist 
name against a database list of known, correctly spelt 
artist names, and checking for a match.. In the case 
where no match is found, a routine is initiated to deter- 
mine if the checked name is not similar in form to an 
artist name in the data base, e.g. whether a double letter 
has been omitted, a character inversion has occurred, 
or if a syllable has been incorrectly spelt. If such is the 
case, then the correctly spelt artist name is automatical- 
ly inserted in place of the incorrectly spelt name. The 
techniques for identifying such possible typos and au- 
tomatically finding and replacing with the appropriate 
word or name is well known in the field of spelling check- 
ers for word processing software and the like. If no sim- 
ilarity is found by the spell checker, then it is assumed 
that the artist name is new for the database and that 
name is simply left as it is. 

[0095] A similar check can also be made on the music 
title, using analogous techniques. The corresponding 
music title data base would normally need to be updated 
more regularly. However, use can be made of the fact 
that music titles are normally composed of words which 
exist individually in spell check dictionaries, especially 
if the latter also contain proper nouns. 
[0096] The final output of this module is artist and title 
information for each file name in a directory. 
[0097] In the embodiment, the remainder of the pro- 
cedure is dedicated to the task of preparing the identified 
and reformatted artist and title name information for fu- 
ture use by the musical category generator 1 6 or music 
playlist generator 18. 

[0098] Once artist and title information is obtained for 
music files, there is associated to each music file a set 
of musical metadata (step S13). In general, these meta- 
data can be any descriptor associated to either an artist 
or a title. They can come from the musical category gen- 
erator 16, a database (internal or external) or from in- 
formation contained in the actual music file correspond- 
ing to the file name in question. The complete set of 
items of information associated to a music file, i.e. the 
rewritten artist name, rewritten music title name and 
metadata are stored within or outside the system (step 
S14) such that they can be later accessed for exploita- 
tion by various possible applications, such as the music 
playlist generator 16. 

[0099] Once a file name has been processed, the pro- 
cedure proceeds to check whether another normalised 
file name is to be processed (step S15). If a new nor- 
malised file name it to be processed, the process returns 
to the step S6 of extracting the next normalised file name 
and continues from that point on. 
[0100] Once all the normalised file names have been 
processed, the procedure branches off from step S15 
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tc its end point. 

[0101] In a variant where it is not envisaged to asso- 
ciate metadata with a file name, the conditional branch- 
ing step S1 5 in view of a possible new file is to be proc- 
essed is simply implemented just after the step S12 of 
applying the rewriting rules. 

[01 02] The remainder of the description shall focus on 
examples of how the extracted and rewritten artist and 
music title names can be exploited. Here, these items 
of information are associated with one particularly useful 
item of metadata: genre/subgenre information. A de- 
tailed description of how genre/subgenre information is 
exploited for the extraction and representation of musi- 
cal metadata from an audio signal, or for rhythm extrac- 
tion is given in European patent application EP^A-00 
400 948.6, filed on April 6 2000 by the present Applicant. 
[0103] In the embodiment, the genre category is a 
simple two-level hierarchical term. 
[0104] At the first level, there is the field GENRES, 
such as "Classical", "Jazz", or "Pop". For each genre, 
there can exist a series of SUBGENRE fields. For in- 
stance, the "Jazz" genre may contain subgenres such 
as "BEBOP", "COOL", "SWING", "BIGBANDS" or even 
"JAZZ GUITAR", etc. 

[0105] For each artist, the ARTISTGENRES data- 
base contains one or several entries, corresponding to 
the genre or subgenre the artist usually belongs to. For 
instance, the database may contain the following en- 
tries: 



MARIAH CAREY 


POP/POP SONG 


THE BEATLES 


POP/POP SONG 


JOHN MC LAUGHLIN 


JAZZ/GUITAR 


JOHN MC LAUGHLIN 


JAZZ/FUSION 


RAMEAU 


CLASSICAL/BAROQUE 




.../... 



where T: indicates the genre/subgenre relation- 
ship. 

[0106] Genre and subgenre are very useful items of 
information when thinking about music. However, genre 
and subgenre categories are not always objective, and 
vary from one culture to another. It is therefore proposed 
to proceed in two steps. First, there is introduced an a 
priori, initial, genre/subgenre categorisation system, 
which contains about 10 000 artists. Secondly, users 
can themselves submit genre/subgenre categories as 
they wish, using an updating mechanism described be- 
low. 

[0107] At the level of the music playlist generator 1 8, 
the rewritten artist names and rewritten music title 
names extracted automatically by the music file identi- 
fier 14 can also be exploited to create intelligent music 
compilations. More particularly, a playlist compiled will 
produce an ordered sequence of music titles to be ac- 
cessed for play on a sound reproduction system or the 
like. The character format of music title names (and pos- 



sibly artist names) in the playlist is standardised at the 
level of the playlist generator's database. The rewritten 
artist and music title names produced by music file iden- 
tifier 14 according to the invention have this same for- 

s mat. Accordingly, the system 1 can exploit a playlist pro- 
duced by the playlist generator 1 8 directly to access the 
corresponding music files, the file names of latter having 
been appropriately rewritten and formatted to corre- 
spond to the items of the playlist. 

10 [0108] The playlist generator 1 8 can be made to pro- 
duce personalised sequences according to music cate- 
gories (genre/subgenre), and following an analysis of 
different inputted "seed" playlists, e.g. from radio sta- 
tions, album compilations, downloaded personal anthol- 

15 ogies, etc. This analysis looks not only for a common- 
ness in the genre/subgenre, but also for the closeness 
of music titles in "seed" playlists, so that if it appears that 
two different titles are found on several instances to ap- 
pear close to (or next) each other, the playlist generator 

20 shall tend to maintain that neighbourhood relationship 
in its outputted personalised playlist. 
[0109] The playlist/music program generator 18 can 
also be endowed with a controllable "musical excursion" 
function, which produces occasional departures from an 

25 imposed genre or category so as to allow the user to 
discover other types of music. The excursions neverthe- 
less tend to follow relationships established in the seed 
playlists, so that a music title corresponding to such an 
excursion shall be placed near (or next) to a music title 

30 within the requested genre or category if the two titles 
in question have been observed as placed together in 
one or more seed playlists. The degree of excursion in 
the personalised playlist is user variable (e.g. through 
an on-screen slider) from zero discovery to maximum 

35 discovery. 

[01 10] The sequences of music title thus produced by 
the playlist generator can thus be based on: 

the user profile, 
40 - similarity, relations, 

the degree of novelty desired. 

[0111] The system 1 can thus allow updating of user 
specific musical information. To allow dynamic updating 

45 of title metadata for genre, there is further introduced an 
updating mechanism. This mechanism simply allows 
users to "post" genre subgenre information for artists. 
This allows the user to post and exchange genres dy- 
namically, for instance to create or foster communities 

so with specific music tastes (e.g. hip-hop fans could create 
a generic hip-hop profile form which interested newcom- 
ers could inherit to be able, right away, to create music 
programs in this style). 

[0112] Moreover, the system can be extended to in- 
55 dude an updating mechanism for TITLE SIMILARITY. 
This consists simply in allowing users to post their play- 
lists (generated by the system or by any other means). 
Each playlist posted by a user is then added to the pool 
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of playlists used by a similarity extractor fully described 
in European patent application EP-A-00 403 556.4 su- 
pra from the present Applicant. 
[0113] Figure 4 illustrates schematically the general 
organisation of information construction in the system 1 
of figure 1 . The music files stored in the hard disk(s) 6 
are accessed for reading their file names by the music 
file identifier 14. From each accessed filename is ob- 
tained the corresponding artist name and title name, us- 
ing the procedure described above in conjunction with 
figures 2 and 3. 

[0114] The artist name and title name information is 
inputted to the musical category generator 1 6, where it 
is matched respectively against an ARTISTGENRE da- 
tabase 30 and a TITLESIMILARITY database 32. From 
these data bases are extracted the appropriate metada- 
ta. 

[0115] The system 1 then constructs for each proc- 
essed file name a data module composed of the follow- 
ing information items : artist, title, genre, and similarity 
(box 34). 

[0116] The resulting data modules are supplied in 
suitable format to the playlist generator 18 where they 
serve as a data source for the generation of playlists 
which can be exported to different units (audio playback 
systems, servers, etc.). 

[0117] There is provided a mechanism for updating 
information from the playlist generator 18 to the 
ARTISTGENRE database 30 and the TITLESIMILARI- 
TY database 32. Thus, a user may add or alter a genre 
or similarity relationship at the level of the playlist gen- 
erator and have this action automatically recorded at the 
level of these data bases 30, 32. 
[0118] The above are merely examples of applica- 
tions for the music file identifier 14 in accordance with 
the invention. It is clear that the invention has many oth- 
er possible applications, such as : 

generally reorganising file names automatically in 
various data bases to make file access amenable 
to automated protocols and procedures, 
presenting users with a uniform file name presen- 
tation (e.g. on screen or printout), to ease viewing 
and assimilation of contents. For instance, it is 
much more satisfying to see a list of music titles in 
a collection of several thousand titles in personal or 
commercial catalogue listings organised in the 
same artist/title order, with uniform typography, 
automatically sorting lists of titles, directories, etc. 
automatically ordering missing titles from a server 
or other source, 
etc. 

[0119] It will be noted that the rewriting rules estab- 
lished at step S12 are purely arbitrary and established 
in accordance with the applications with which the music 
file identifier is to cooperate. Other rules can be envis- 
aged according to circumstances. Thus, rules giving a 



more presentable format for human intelligibility can be 
applied for producing on-screen displays or printouts. 
[0120] Although the embodiments of the invention 
have been presented in the context of music files, the 
5 invention has a much broader spectrum of applications, 
and covers all situations where items of information are 
stored in different possible ways and these items need 
to be extracted automatically. Examples of other areas 
of application of the invention are : 

10 

managing computer file names, each correspond- 
ing to a text, spreadsheet, database file or the like. 
For instance, a number of persons using a compu- 
ter system may each have their own way of naming 

15 the files they create, but generally in a manner 
which contains two or more items of information, e. 
g. among customer name or reference, internal ref- 
erence, date, etc. A file identifier analogous to that 
described above can then be implemented to infer 

20 the items of information from the different file names 
so that they can be indexed appropriately in a cen- 
tralised database; 

managing lists of grouped items of information in 
general, in which the grouping is not standardised. 
25 The lists in question can be referred e.g. to publica- 
tions with items of information corresponding to au- 
thor (name and first name), title, editor, or indeed 
any other inventory, 
- - etc. 

30 

[0121] It is clear that there are many different ways of 
implementing the invention, in terms of both hardware 
and software. A largely software implementation can be 
envisaged, with heavy dependency on hardware re- 

35 sources of an existing system, such as the personal 
computer (PC) of figure 1 . In this case, the different nec- 
essary algorithms would be executed at the level of the 
PC's CPU 4, with intermediate results stored in the PC's 
internal memory spaces. 

40 [0122] A predominantly hardware implementation of 
the invention can also be envisaged in the form of a com- 
plete stand-alone unit complete with its own processor, 
memory, interfaces, display, import/output ports, etc. 
[0123] Between these extremes, other intermediate 

45 forms of implementation can be chosen arbitrarily. 



Claims 



so 1 . Method of automatically identifying in a set of data 
sequences at least one specific type of information 
contained in each data sequence of the set, wherein 
said type of information has an unknown presenta- 
tion in said data sequences, characterised In that 

55 it comprises the steps of: 

initially defining at least one characteristic fea- 
ture of said specific type of information, and of 
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expressing said characteristic feature(s) in 
terms of at least one recognition rule executa- 
ble by processor means (2), 
applying said recognition rule(s) through said 
processor means to analyse said set of data se- 5 
quences, 

determining in each data sequence a data por- 
tion thereof satisfying said recognition rule(s), 
and 

- identifying said data portion as corresponding to 
to said specific type of information. 

2. Method according to claim 1 , wherein said data se- 
quence corresponds to characters forming a file 
name of a computer file. 15 

3. Method according to claim 1 or 2, wherein : 

said set of data sequences corresponds to a re- 
spective set of file names of music files, each 20 
data sequence being the characters forming a 
corresponding music file name, and a said data 
portion being a character field containing infor- 
mation of a given type, and 

- said specific type of information to be identified 25 
comprises at least one of: 

a first type of information corresponding to 
an artist name contained in said music file 
name, and 30 

a second type of information correspond- 
ing to a music title name contained in said 
music file name. 

35 

4. Method according to claim 3, further comprising a 
step, prior to said determining step, of determining 
a separator character present between character 
fields respectively assigned to said first and second 
types of information. 40 

5. Method according to claim 3 or 4, further comprising 
a step of detecting the presence of a character clus- 
ter composed of a first part which is constant and a 
second part which is variable over said set of music *5 
file names, said second part being e.g. an integer 

or equivalent count character, and of eliminating 
that character cluster from said character se- 
quence. 

50 

6. Method according to any one of claims 3 to 5, 
wherein a said recognition rule instructs to identify 
said first type of information as contained in the 
character field forming the most words among char- 
acter fields assigned to respective types of inf orma- 55 
tion. 

7. Method according to any one of claims 3 to 6, 



wherein a said recognition rule instructs to identify 
said first type of information as contained in the 
character field which has the most occurrence in 
identical form in said set of music file names. 

8. Method according to any one of claims 3 to 7, 
wherein a said recognition rule instructs to identify 
said first type of information as contained in the 
character field matching a character field in a set of 
stored character fields corresponding to artist 
names. 

9. Method according to any one of claims 3 to 8, 
wherein a said recognition rule instructs to identify 
said first type of information as contained in the first 
character field appearing in the music file name. 

10. Method according to any one of claims 3 to 9, 
wherein said determining and identifying steps in- 
volve the sub-steps of: 

identifying in said characters forming said mu- 
sic file name a first character field and a second 
character field, one said field containing the first 
type of information (artist name) and the other 
containing the second type of information (mu- 
sic title name), 

determining, by reference to an artist database 
containing character fields each corresponding 
to a respective artist name, a first value (OCC1 ) 
corresponding to the number of occurrences, 
over said set of music file names, of a first char- 
acter field contained in said artist database, 
and a second value (OCC2) corresponding to 
the number of occurrences, over said set of mu- 
sic file names, of a second character field con- 
tained in said artist database, wherein 
if said first value (OCC1) is greater than said 
second value (OCC2), identifying said first 
character field as corresponding to an artist 
name, 

if said second value (OCC2) is greaterthan said 
second value (OCC1), identifying said second 
character field as corresponding to an artist 
name, 

- if said first and second values (OCC1 , OCC2) 
are equal, continuing by: 

determining a new first value (OCC1) cor- 
responding to the number of different con- 
tents of said first character field over the 
set of music file names and a new second 
value (OCC2) corresponding to the 
number of different contents of said second 
character field over the set music file 
names, wherein 
- if said first value (OCC1) is greater than 
said second value (OCC2), identifying said 
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second character field as corresponding to 
an artist name, 

- if said second value (OCC2) is greater than 
said second value (OCC1), identifying said 
first character field as corresponding to an 5 
artist name, 

- if said first and second values (OCC1, 
OCC2) are equal, continuing by: 

determining a new first value (OCC1 ) io 
corresponding to the total number of 
words in said first character field 
summed over the entire set of music 
file names and a new second value 
(OCC2) corresponding to the total *5 
number of words in said second char- 
acter field summed over the entire set 
of music file names, wherein 
- if said first value (OCC1) is greater 
than said second value (OCC2), iden- 20 
tifying said first character field as cor- 
responding to an artist name, 
if said second value (OCC2) is greater 
than said second value (OCC1), iden- 
tifying said second character field as 25 
corresponding to an artist name, and 
if said first and second values (OCC1 , 
OCC2) are equal, identifying said first 
character field as corresponding to an 
artist name. 30 

11 . Method according to any one of claims 3 to 1 0, fur- 
ther comprising the step of applying rewriting rules 
to at least one of an artist name and a music title 
name identified from a said music file name, said 35 
rewriting rules being executable by said processor 
means (2) for transforming an artist name/music ti- 
tle name into a form corresponding to that used for 
storing artist names/music title names in a data- 
base. 40 

12. Method according to claim 11 , further comprising a 
step of compiling a directory of rewritten music file 
names, corresponding to said identified music file 
names, in which at least one of an artist name and ^ 
a music title name is organised to be machine read- 
able. 

13. Method according to any one of claims 3 to 12, fur- 
ther comprising the step of constructing for each so 
music file name a machine readable information 
module comprising at least of an identified artist 
name and an identified music title name, to which 

is associated metadata, said metadata being pro- 
vided from a database on the basis of said identified 55 
artist name and/or music title name. 

14. Method according to claim 13, wherein said meta- 



data is indicative of a genre or genre/subgenre as- 
sociated with the corresponding music title. 

15. Use of the method according to any one of claims 
3 to 14 in a music playlist generator (16), wherein 
said playlist generator accesses stored music files 
by reference to identified artist names and/or iden- 
tified music title names. 

16. Apparatus (1, 12) for automatically identifying in a 
set of data sequences at least one specific type of 
information contained in each data sequence of the 
set, wherein said type of information has an un- 
known presentation in said data sequences, char- 
acterised in that it comprises the steps of: 

means (5, 7) for expressing at least one char- 
acteristic feature of said specific type of infor- 
mation, and for expressing said characteristic 
feature(s) in terms of at least one machine ex- 
ecutable recognition rule, 
processor means (2) for applying said recogni- 
tion rule(s) to analyse said set of data sequenc- 
es, 

- determining means (2) for determining in each 
data sequence a data portion thereof satisfying 
said recognition rule(s), and 
identifying means for identifying said data por- 
tion as corresponding to said specific type of 
information. 

17. Apparatus according to claim 1 6, wherein said data 
sequence corresponds to characters forming a file 
name of a computer file. 

18. Apparatus according to claim 16 or 17, wherein : 

said set of data sequences corresponds to a re- 
spective set of file names of music files, each 
data sequence being the characters forming a 
corresponding music file name, and a said data 
portion being a character field containing infor- 
mation of a given type, and 
said specific type of information to be identified 
comprises at least one of : 

a first type of information corresponding to 
an artist name contained in said music file 
name, and 

a second type of information correspond- 
ing to a music title name contained in said 
music file name. 

1 9. Apparatus according to claim 1 8, further comprising 
separator character means for detecting a separa- 
tor character present between character fields re- 
spectively assigned to said first and second types 
of information. 



13 



25 



EP 1 253 529 A1 



26 



20. Apparatus according to claim 1 8 or 1 9, further com- 
prising means for detecting the presence of a char- 
acter cluster composed of a first part which is con- 
stant and a second part which is variable over said 
set of music file names, said second part being e. 
g. an integer or equivalent count character, and for 
eliminating that character cluster from said charac- 
ter sequence. 

21 . Apparatus according to any one of claims 1 8 to 20, 
wherein a said recognition rule instructs to identify 
said first type of information as contained in at least 
one of: 



10 



a new second value (OCC2) corresponding to 
the number of different contents of said second 
character field over the set music file names, 
wherein 

- if said first value (OCC1) is greater than 
said second value (OCC2), said second 
character field is identified as correspond- 
ing to an artist name, and 

- if said second value (OCC2) is greaterthan 
said second value (OCC1), said first char- 
acter field is identified as corresponding to 
an artist name, 



i) the character field forming the most words 
among character fields assigned to respective 
types of information, 

ii) the character field which has the most occur- 
rence in identical form in said set of music file 
names, 

iii) the character field matching a character field 
in a set of stored character fields corresponding 
artist names, and 

iv) the first character field appearing in the mu- 
sic file name. 

22. Apparatus according to any one of claims 1 8 to 21 , 
further comprising : 

- means for identifying in said characters forming 
said music file name a first character field and 
a second character field, one said field contain- 
ing the first type of information (artist name) and 
the other containing the second type of infor- 
mation (music title name), 

- means for determining, by reference to an artist 
database containing character fields each cor- 
responding to a respective artist name, a first 
value (OCC1) corresponding to the number of 
occurrences, over said set of music file names, 
of a first character field contained in said artist 
database, and a second value (OCC2) corre- 
sponding to the number of occurrences, over 
said set of music file names, of a second char- 
acter field contained in said artist database, 
wherein 

- if said first value (OCC1) is greater than said 
second value (OCC2), said first character field 
is identified as corresponding to an artist name, 

- if said second value (OCC2) is greaterthan said 
second value (OCC1), said second character 
field is identified as corresponding to an artist 
name, 

means, operative if said first and second values 
(OCC1, OCC2) are equal, for determining a 
new first value (OCC1) corresponding to the 
number of different contents of said first char- 
acter field over the set of music file names and 



15 - means operative if said first and second values 
(OCC1, OCC2) are equal, for determining a 
new first value (OCC1 ) corresponding to the to- 
tal number of words in said first character field 
summed over the entire set of music file names 
20 and a new second value (OCC2) correspond- 

ing to the total number of words in said second 
character field summed over the entire set of 
music file names, wherein $ 

25 - if said first value (OCC1) is greater than 

said second value (OCC2), said first char- 
acter field as corresponding to an artist 
name, and 

if said second value (OCC2) is greaterthan 
30 said second value (OCC1), said second 

character field as is identified as corre- 
sponding to an artist name, and 

means, operative if said first and second values 
35 (OCC1 , OCC2) are equal, for identifying said 

first character field as corresponding to an artist 
name. 

23. Apparatus according to any one of claims 1 8 to 22, 
40 further comprising rewriting means for applying re- 
writing rules to at least one of an artist name and a 
music title name identified from a said music file 
name, said rewriting rules being executable for 
transforming an artist name/music title name into a 

45 form corresponding to that used for storing artist 
names/music title names in a database. 

24. Apparatus according to claim 23, further comprising 
compiling means for compiling a directory of rewrit- 

so ten music file names, corresponding to said identi- 
fied music file names, in which at least one of an 
artist name and a music title name is organised to 
be machine readable. 

55 25. Apparatus according to any one of claims 1 8 to 24, 
further comprising constructing means for con- 
structing for each music file name a machine read- 
able information module comprising at least of an 
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identified artist name and an identified music title 
name, to which is associated metadata, said meta- 
data being provided from a database on the basis 
of said identified artist name and/or music title 
name. 5 

26. Apparatus according to claim 25, wherein said 
metadata is indicative of a genre or genre/subgenre 
associated with the corresponding music title. 

10 

27. System combining an apparatus according to any 
one of claims 1 6 to 26 with a music playlist gener- 
ator (16), wherein said playlist generator accesses 
stored music files by reference to identified artist 
names and/or identified music title names. 15 
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